all code available here

Data Characteristics and Decisions

Inclusion / Exclusion Criteria

  • Start date was selected to be 05/31/2021
  • End date was selected to be 12/31/2021
  • For some states, December data has not been fully entered
  • Dropped the US territories and Washington DC
  • NYC was separate from NY in state dataset, these were combined

Missing or Repressed Data

  • Repressed data is represented by NA, while data that was not obtained (not filled out on a form) is represented by ‘Missing’.
  • Data that was denoted as ‘Unknown’ were filled out as Unknown.
  • Cases from the day of 5/31 are excluded from generated proportions because county level data only has month specified so cases from 5/31 are indistinguishable from the rest of May.
  • For county data, age group ‘Missing’ and ‘NA’ values were combined and ultimately dropped.

Evaluating Missingness

County Data

There are a negligible number of cases at the county level where the state wasn’t known

There are relatively few cases at the county level that didn’t document age group

A lot of the death data is not documented at the county level. I think we should pull this information from a separate dataset.

State Data

some states didn’t list conf_death and conf_cases (cumulative counts), these were left empty
for analysis these were transformed to NA
only conf_death and conf_cases had NAs, rest of the variables (state, new_case, new_death, sub_date, time_interval) had complete information

Proportioning Cases

  • Variant proportion data is available for 2 wk periods
  • The state level data gives us the number of cases for every 2 wk period but is not age stratified
  • The county level data is age stratified but only gives the number of cases on a monthly basis
  • We can apply the monthly distribution of cases per age group (at the county level) to the state level data to proportion the cases to age groups. This assumes that the distribution of cases per age group is the same for the 2 2 wk periods that make up a month.

Distribution of NAs Over Study Period (County Level Data)

There are states where for a month or several months, age group is not available (suppressed or missing)

Case Age Group Distribution Over Time

All States

Missouri

Oregon

Tennessee

Texas

California

New York

For the most part, the distribution of cases over the age groups does not seem to vary a lot. As a result, for state:month combinations that we don’t have county case data on, it may be ok to use the distribution from the month before or after

Current Dataset

month state time_interval case_tot death_tot age_group prop case_per_age
5 AK 2021-05-31 408 4 0 - 17 years 0.1770833 72
5 AK 2021-05-31 408 4 18 to 49 years 0.5911458 241
5 AK 2021-05-31 408 4 50 to 64 years 0.1901042 78
5 AK 2021-05-31 408 4 65+ years 0.0416667 17
6 AK 2021-06-14 409 1 0 - 17 years 0.1770833 72
6 AK 2021-06-14 409 1 18 to 49 years 0.5911458 242
6 AK 2021-06-14 409 1 50 to 64 years 0.1901042 78
6 AK 2021-06-14 409 1 65+ years 0.0416667 17
6 AK 2021-06-28 625 5 0 - 17 years 0.1770833 111
6 AK 2021-06-28 625 5 18 to 49 years 0.5911458 369
6 AK 2021-06-28 625 5 50 to 64 years 0.1901042 119
6 AK 2021-06-28 625 5 65+ years 0.0416667 26
7 AK 2021-07-12 2210 5 0 - 17 years 0.2067763 457
7 AK 2021-07-12 2210 5 18 to 49 years 0.5386148 1190
7 AK 2021-07-12 2210 5 50 to 64 years 0.1524664 337
7 AK 2021-07-12 2210 5 65+ years 0.1021425 226
7 AK 2021-07-26 3677 13 0 - 17 years 0.2067763 760
7 AK 2021-07-26 3677 13 18 to 49 years 0.5386148 1980
7 AK 2021-07-26 3677 13 50 to 64 years 0.1524664 561
7 AK 2021-07-26 3677 13 65+ years 0.1021425 376